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Abstract. For a model of molecular evolution to be useful for 
phylogenetic inference, the topology of evolutionary trees must be 
identifiable. That is, from a joint distribution the model predicts, 
it must be possible to recover the tree parameter. 

We establish tree identifiability for a number of phylogenetic 
models, including a covarion model and a variety of mixture models 
with a limited number of classes. 

The proof is based on the introduction of a more general model, 
allowing more states at internal nodes of the tree than at leaves, 
and the study of the algebraic variety formed by the joint distribu- 
tions to which it gives rise. Tree identifiability is first established 
for this general model through the use of certain phylogenetic in- 
variants. 



1. Introduction 

In phylogenetics, probabilistic models of the evolution of biological 
sequences (DNA or proteins, for example) are used to infer evolutionary 
history. 

The parameters of such a model typically include such things as the 
topology of the rooted tree depicting the temporal ordering of speci- 
ation events, the elapsed time between these events, and the rates at 
which different types of substitutions {A ^ C, A ^ G, etc.) occur 
between events. While any of these parameters might be of interest in 
a particular study, the tree topology is often the one of greatest interest 
(and one on which the very definition of the others depends). 

A basic question concerning any statistical model then is whether 
it is identifiable: Given a distribution of observations that the model 
predicts, is it theoretically possible to recover the parameters of the 
model? Understanding what parameters are identifiable for a model is 
crucial to understanding what we may reasonably hope to infer from 
data. 
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In particular, identifiability of the tree topology is essential for any 
model that is to be used in inferring evolutionary histories from data. 
If a tree is not uniquely determined by an expected joint distribution, 
then one has no hope of using the model to infer trees well from data. 
Indeed, proofs of the statistical consistency of an inference method such 
as maximum likelihood begin by establishing identifiability of parame- 
ters. 

Identifiability of tree topologies has been investigated for a number 
of models. See , for example, PMl LSSH94L IChaQBl IWS97L IBaa981 
ISHP98| |Rog01| for both positive and negative results. However, much 



remains to be done. As pointed out in jBGPOSj . for mixture mod- 
els that allow several classes of sites in sequences to evolve at differ- 
ent rates, "[njothing has been proved in the general context yet." In 
fact, most proofs of tree identifiability for various models have been 
based on notions of phylogenetic distances and the four-point condi- 
tion of Buneman |Bun71j . and for general mixture models no such 
distance is known. (See also |EW04j for related work illustrating non- 
identifiability of edge lengths for mixture models.) 

In this paper we establish a general identifiability result applicable 
to many mixture models, albeit with a limited number of classes. As 
further motivation for our work, however, we choose to highlight the 
covarion model, for which the question of identifiability of trees has 
also been open. 

A covarion model of character evolution describes characters with 
states that are only partially observable. Such a model can be viewed as 
a type of hidden Markov model on a tree, where not only are character 
states at all internal nodes of the tree hidden, as in simpler phylogenetic 
models, but even at the leaves full information is not available. 

For instance, in the covarion model of TufHey and Steel |TS98j for 



the evolution of DNA sequences, each site in the sequences is a char- 
acter. As evolution proceeds over a tree, this character is in one of the 
eight states Ao„, A^jj, Gon, Gojj, C^n, Cojj, T^n, Tojj, where the sub- 
script on denotes the site is currently free to undergo base substitution, 
and off denotes that it is currently invariable. The off-states indicate 
functional or other biological constraints temporarily preventing sub- 
stitutions. Importantly, as evolution proceeds over the tree, sites may 
pass from on-states to off-states and vice versa. However, when we 
observe sequences from currently extant taxa, we can only observe A, 
C, G, or T; we obtain no information as to whether a site is currently 
on or off. 
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That constraints to nucleotide substitution might change over a tree 
is a biologically plausible hypothesis that makes covarion models at- 
tractive. A covarion model might be viewed as a type of rate variation 
model, as several characters described by the same model may be in 
on- and off-states for different durations, and thus undergo different 
amounts of substitution. However, the 'switching' between on- and off- 
states, allowing a character to behave differently in one part of a tree 
from another, is a crucial distinction from standard approaches to rate 
variation. Of course more elaborate covarion models, with more than 
the two hidden rate-classes of the example above, are easily devised. 
For instance there might be off-, slow-, and fast-states, with the oppor- 
tunity for a character to pass in and out of each as evolution proceeds 
over the tree. 

Though originally proposed by Fitch and Markowitz |FM70j . it was 



not until the work of Tuffley and Steel jTS98j that a covarion model 
was mathematically formalized and the first steps were taken in its the- 
oretical analysis. More recently, Galtier |Gal01j implemented a maxi- 
mum likelihood inference package using a covarion model, and reported 
improved fits to data over standard rate-variation models. See also 
|PM(mOlj for a more thorough overview and arguments in support of 
the use of such models. 

Although the covarion model is appealing for biological reasons, it is 
less well understood theoretically. For instance, many basic questions 
of identifiability of model parameters have been open. Indeed, much 
of |TS98j is focused on showing that in some circumstances a covarion 
model is distinguishable from a rate-variation model, and that some 
features of the tree topology are identifiable provided one has prior 
knowledge of some clades. 

Motivated by the covarion model, in this paper we first prove a re- 
sult on identifiability of tree topologies for a more general phylogenetic 
model. The model is introduced in Section |21 and the result proved in 
Section m We also show how the result specializes to establish identi- 
fiability of the tree topology for more specific models of greater direct 
interest for applications, including the covarion model and certain mix- 
ture models. In Sections 01 we describe some of these models, and in 
SectionElwe apply the general result to deduce tree identifiability under 
certain assumptions. 

Actually, our results require some mild restrictions on model param- 
eters — it is better to say that tree topology is identifiable for generic 
parameters. Informally this means if parameters are chosen "at ran- 
dom," then the topology can be identified. More precisely, our use of 
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the word "generic" is as in algebraic geometry: We say a property holds 
for generic parameters if it holds for all parameters off of a proper sub- 
variety of the parameter space. By "subvariety" we might mean either 
an algebraic subvariety, defined by the vanishing of a set of multivari- 
able polynomials, or, more generally, an analytic subvariety, defined 
by the vanishing of analytic functions. Since in either circumstance a 
proper subvariety is a closed set of lower dimension than the ambient 
space, generic parameters form an open, dense subset of the parameter 
space. 

As the last paragraph hints, our approach throughout is algebraic, 
and provides a good illustration of the value of an algebraic viewpoint 
for statistical models. Within phylogenetics, this approach began with 
the introduction of the idea of phylogenetic invariants by Cavender 
and Felsenstein |CF87j and Lake |Lak87j . Notable contributions for 
group-based models appeared in |ES93j and |SSE93j . building on an 



idea first introduced in |Hen89j . It has been pursued in a number 
of recent works focused on phylogenetics, such as |CHHP00l IAR031 
k^K.M()3[ IXHM lERSSQj^. EmF\ IMUM IXRoHH KX^M IEn05]. 
and more broadly for biological application in the recent volume jPSOSj. 

Though our emphasis here is on theory, practical methods of identify- 
ing tree topologies from data are also needed. For instance, the notions 
of phylogenetic distance that play a key role in theoretical identifica- 
tion of tree topologies for simpler models also provide useful tools for 
tree inference. Whether one wishes to base inference on a distance- 
based method, or merely view such methods as fast heuristic means of 
finding good candidate trees to begin a more elaborate search of tree 
space, the value of distances is clear. For models where no distance 
formulas are known, the explicit polynomials our results yield, whose 
vanishing on a joint distribution identifies the tree topology, might play 
a similar role. It will be interesting to see if these polynomials might 
be exploited for practical inference, either heuristically or on a more 
solid statistical basis. 

Finally, we thank Cecile Ane for first suggesting to us that the co- 
varion model might be tractably studied by our methods. 

2. The (A, k)-state general Markov model 

In this section we introduce a phylogenetic model which allows more 
states at internal nodes of the tree than at leaves. Though motivated 
by the covarion model of jTS98j , our model is much more general. We 
emphasize that we introduce this model not because we feel it precisely 
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captures any biological phenomena, but rather because its generality 
encompasses a variety of models of more direct biological interest. It 
will allow us to make the key algebraic ideas in our subsequent ar- 
guments on identifiability clear, and obtain results which can then be 
applied to more specialized models. 

Throughout, suppose T is a trivalent (i.e., binary) tree. Choosing 
some internal vertex r as a root, denote the resulting rooted tree by 
T^. Corresponding to each leaf of the tree we have an observed random 
variable with state space [k] = {1,2,...,^}, while for each internal 
vertex we have a hidden (unobserved) variable with state space [A] . The 
states of observed variables might represent the bases at a site in DNA 
sequences (k = 4) from extant taxa, while states of hidden variables 
might represent ancestral bases together with additional features, such 
as how rapidly a site currently undergoes mutation, or even whether it 
is currently invariable. With this interpretation in mind, we will always 
assume \ > k. 

A A-element row vector tt^ describes the probabihty distribution of 
the states for the root variable. For each internal edge e of the tree, 
with e directed away from the root, a A x A Markov matrix Me describes 
transition probabilities. For each pendant edge e, a Ax k Markov matrix 
Me describes transition probabilities. Thus Me{i,j) is the conditional 
probability of state j at the end of e given state i at its start. Stochastic 
assumptions ensure that all entries are non-negative, the entries of tt^ 
sum to 1, and the entries in any row of any of the M^. sum to 1. With 
no further restrictions imposed on either the root distribution or the 
Markov matrices, we call this the (A, n)-state general Markov model on 
the rooted tree T'^ . 

In the case \ — this model is the usual general Markov model with 
K states. We are therefore particularly interested in cases where \> k. 
For instance, a generalization of the on-off covarion model described in 
the introduction has A = 2k. 

For a fixed n-leaf rooted tree T^, we may make some choice of entries 
in and each row of the Me to view as independent variables, using 
the condition that rows sum to 1 to determine the remaining entry. 
Since T'^ has n pendant edges and n — 3 internal edges, the stochastic 
parameter space S for this model can thus be identified with a subset 
of [0, 1]^ where M = (A - 1) + n\{K - I) + {n - 3)A(A - 1). 

The probabilities of observing each of the possible patterns (i.e., 
assignments of states) of leaf variables can then be given as polynomial 
expressions in the parameters. That is, there is a polynomial map, the 
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parameterization map, 

which gives the joint distribution of observed states at the leaves of 
as a function of the parameters. We extend this to a polynomial map 
C'^ C*" which we also denote by (pT^ , and refer to C'^ as the complex 
parameter space. 

The phylogenetic variety for the (A, K)-state model on T*" is the al- 
gebraic variety defined as 

where the bar denotes the (Zariski and standard) topological closure 
in C^". 

Lemma 1. Let T be an n-leaf trivalent tree, and ri, r2 any two internal 
nodes. Then Vt^i^x^^ = Vt''2,a,k; so this variety may be denoted by Vt.a.k- 

Proof. This is proved similarly to the corresponding result for the gen- 
eral Markov model, as in |SSH94j or |ARn3j . □ 



3. Algebraic and analytic submodels of the (A, k)-state 

GENERAL MARKOV MODEL 

To further motivate our introduction of the (A, K)-state general Markov 
model, we note that many models of molecular evolution can be viewed 
as submodels of it, in that they simply place more restrictive assump- 
tions on the allowable parameter values. In this section we first indicate 
some of these submodels of interest. 

We also introduce the idea of an analytic {X, K)-state model, which 
will allow us not only to deal with parameterization maps in which joint 
distributions are expressed by polynomial formulas in the parameters, 
but a wider class that encompasses the 'rate matrix' models that are 
so commonly used in applications. 

Some specific examples of models that can be viewed as submodels 
of the general (A, /€)-state model include the following: 

(1) GM: As already stated, the K-state general Markov model re- 
sults from \ = K. 

(2) GM+I: This model allows two classes of sites in sequences; one 
class mutates according to the general Markov model, and an- 
other is held invariable. A parameter / denotes the propor- 
tion of sites in the first class, with 1 — / in the second. If 
the root distribution vectors for the two classes are tti and 7r2, 
let TTr = (/tti, (1 — f)T^2)- For an internal edge e, if Nf, is the 
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KXK matrix describing transition probabilities for the first class 
along that edge, let 



a 2k X K matrix. Thus the model results from simply restricting 
parameters in the general (2«;, «;)-state model so that all Markov 
matrices have a particular form. 

The restricted parameter space can be identified with C^, 
where M — {2k — 1) + (2n — 3)k,{k, — 1). Note the parameteriza- 
tion map giving joint distributions as a function of parameters 
for this model is still a polynomial one, given by restriction of 
the map for the general {2k, fi;)-state model to a smaller domain. 

(3) GM+GMH +GM: We consider m classes of sites, each evolv- 
ing independently according to a different GM model. To view 
this as a submodel of our more general model, for each internal 
edge of the tree we create an rriK x rriK block-diagonal Markov 
matrix, with each of the m kx k blocks giving transition prob- 
abihties for a particular class. On pendant edges we 'stack' the 
blocks, giving ruK x k matrices. The root distribution is simi- 
larly obtained by concatenating the root distributions for each 
class, weighted by additional parameters describing the relative 
frequencies of the classes. Thus we are dealing with a restric- 
tion of the {rriK, «;)-state model, with parameter space identified 
with C-^ where M = {jtik — 1) + (2n — ?))mK{K — 1). Again, the 
parameterization map for this model is polynomial. 

(4) Other algebraic models: In the previous examples, we can re- 
place an occurrence of GM by a submodel, such as the Jukes- 
Cantor, Kimura 2-parameter, Kimura 3-parameter, or Strand 
Symmetric, defined by further restriction of parameters. Allow- 
ing arbitrary matrices of these types on each edge (so that we do 
not assume a common rate matrix) , we again have a polynomial 
parameterization map with domain for some M. 

The previous examples all lie fully within an algebraic framework, 
but in fact many of the models used for inference in current applications 
are not of this sort. Rate matrix models assume more commonality to 
the substitution process on the various edges of the tree. Typically, 
one fixes a rate matrix Q with non-negative off-diagonal entries and 




a, 2k X 2k matrix. For pendent edges e, let 
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rows summing to 0. Then each edge of the tree is assigned a scalar 
edge length tf,, and the Markov matrix = exp[Qte) gives transition 
probabilities for that edge. 

(5) GTR: A submodel of the general (k, K)-state model, the general 
time-reversible model assumes a root distribution tTj. and rate 
matrix Q such that tt^Q = and diag(7rj,)(5 is symmetric. Pairs 
TZr, Q with these properties can be parameterized by (k — 1) + 

— 1)/2 scalars. Since we may normalize so one edge length 
is 1, the parameter space for the full model is of dimension M = 
(/€—!) + k{k — 1)/2 + (2n — 4). However, the parameterization 
map giving joint distributions is not polynomial, as it involves 
a composition of matrix exponentials with the general Markov 
parameterization. Nonetheless, it is an analytic map. 

(6) GTR+rate-classes: Let 7rr,Q be as in the GTR model. As- 
suming m different classes of sites, we assign each a relative 
frequency fi and a scalar rate parameter Aj, with Ai = 1. Then 
the ith class undergoes substitutions on an edge e according 
to Ne = exp{QXite). For internal edges of the tree we embed 
the iVe as blocks in a larger block-diagonal matrix Mg, while 
for pendant edges we stack them, obtaining an expression of 
this model as a submodel of the general (m/t, K)-state model. 
The parameter space is of dimension M = (k — + — 
l)/2-|-2(m — 1) + (2n — 4). Again we have an analytic, but not 
polynomial, parameterization map. 

Note that our formulation requires a finite number of rate 
classes. While current literature often refers to a continuous 
distribution of rates (usually with a F distribution), in practice 
inference is always done with a discretization of the distribution, 
producing finitely many rate classes as here. 

(7) GTR+I+rate-classes: This model is simply the last, with the 
further assumption A2 = 0. 

(8) Covarion: As formulated in |TS98j . the Tuffley-Steel covarion 
model hypothesizes a common 2k x 2k rate matrix Q of a par- 
ticular form. Let tt, i? be a root distribution and rate matrix 
for the K-state GTR model. Let Si, S2 > 0, and set di = ^ , 



= 



S1+S2 



. Then 




R — s\I s\I 
S2I -S2I 



) 



is the rate matrix for a 2k state time-reversible process, sta- 
tionary on the root distribution vector TZr = (ctiTT, 0"27r). In the 
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language of the introduction, the first k states represent bases 
that are on, and the last k ones that are off. 

The covarion model then associates to each internal edge e of 
the tree the Markov matrix Mg = exp{Qte), and to a pendant 
edge e the matrix Mg = exp{Qte){lKXK Ikxk)^ ■ This model is 
therefore a submodel of the (2k, fi;)-state model. 

Since the parameters for the covarion model can be viewed 
as (tt, _R, Si, S2, {^e}), the parameter space is of dimension M = 
(k— l)/2+2+(2n— 4). As with all rate matrix models, 
the parameterization map is analytic, though not polynomial. 
(9) The model referred to as the SSRV in |Gal01j generalizes the 
covarion model to m rate classes, sharing the same rate matrix 
R, with switching allowed between the classes. It can similarly 
be seen to be a submodel of the (m/t, K)-state model, with an 
analytic parameterization map. 

Of course many more variations are possible. The reader familiar 
with other basic models will have no trouble modifying the examples 
above, adding rate classes if desired, or even mixing several different 
models as separate classes. 

To formalize a notion of submodel of the (A, K)-state model that 
encompasses the above and other examples, we introduce some new 
terminology. 

By a submodel of the (A, fi;)-state general Markov model on a tree 
we mean a restriction of parameters to a subset of the full param- 
eter space C^. Suppose the set of (A, K)-state general Markov model 
parameters s G allowed in the submodel is ip{U), the image under 
some analytic map : U ^ of an open set U C M.'^^ . Then we say 
the submodel is an analytic (A, k) -state model with parameter space U 
and Markov map ip. The parametrization map for the analytic model 
is then oip^ where (pT^- is the parameterization map for the general 
(A, ft;)-state model: 

Thus, for instance, the covarion model is a analytic (2k, K)-state 
model, as is the GM+I model. Note that algebraic submodels, with 
polynomial Markov maps, are included among the analytic ones. An- 
alyticity of the Markov map will be important for our arguments in 
Section 

While all of the enumerated models above are analytic (A, K)-state 
models, they in fact have additional structure in common. First note 
that in each A = mn for some m. Moreover, the set of states [A] at each 
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internal node is naturally identified with the set [k] x [m]. Under this 
identification, if we refer to the observable states in [k] as 'bases', then 
a state represents 'base i' and 'class j\ Here 'class' might refer to 
'rate class', or some other characteristic, such as the on/ofT feature in 
the covarion model. Finally, in all of these models the Markov matrices 
on pendant edges of the tree have a form 

M = M(/ / ... If, 

where M is an m.K x rriK Markov matrix of the sort allowed on internal 
edges. Essentially this means the model hides all class information, 
so only bases are observable at leaves. Wc refer to such an analytic 
{rriK, fi;)-state model as an analytic n-hase, m-class model. 



4. Identifiability of the tree topology for the general 

(A, k)-state model 

Returning to consideration of the (A, «;)-state general Markov model, 
in this section we establish our main technical result on generic identi- 
fiability of tree topologies. 

We first consider identifiability of the tree topology from a joint 
distribution of states at the leaves in the case of a 4-leaf tree. Let the 
three possible trivalent trees with leaves a, 6, c, d be denoted by 

Ti rri rri rri rri rri 

1 = J-ab\cd, -1-2 = -J-aclbd, 3 = -J- ad\bc, 

where the subscript uv\wx denotes leaves u, v are adjacent to a common 
internal node, leaves tv. x. 

Focusing on Ti, denote the internal nodes by r, /, so that the root r 
is adjacent to the leaves a and b. For s e C^, the complex parameter 
space, let the corresponding vector and matrix parameters (with rows 
summing to 1) be TZr G C-^, M^/ G Maxa(C), Mra, Mrb, Mfc, Mfd G 
Max.(C). 

Then P = (pTi (s) can be expressed a-sa-KXuxuxn tensor whose 
entries P{i,j,k,l) give the 'probability' of observing states i,j,k,l at 
leaves a, b, c, d, respectively, and are given by the following formula: 

Let A = diag(7rr)Mr/, a A x A matrix. Then define aAxAxAxA 
tensor Q by 



Q{i,j,k,l) 




a i — j and k — I, 
otherwise. 
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Finally, let each of the Mra, Mrb, Mfc, Mjd act in the consecutive indices 
of Q to yield P, i.e., 

(1) P{z,j,k,l) = 

A 

^ Q{t', f, k', l')Mra{l\ t)Mrt{f, j)Mf,{k', k)Mfd{l', /)• 
i',j',k',l'=l 

To motivate Equation (jT)), and our subsequent arguments, one should 
think of the matrix A as expressing the joint distribution of states at 
the vertices r and /. The tensor Q then represents the joint distribution 
for a (A, A)-state model on a quartet tree where no state changes occur 
along the pendant edges (i.e., the Markov matrices on these edges are 
/), as illustrated at the left in Figure H The model producing P 
'extends' these terminal edges, placing the A x k Markov matrices on 
the extensions, as shown on the right. 



a c a c 




Figure 1. Models underlying the tensors Q and P. 

Importantly, Equation (0) can be re-expressed several ways. The 
first of these naturally reflects the topology of T\. Let Y\dXo.h\cd{ff) be 
the A^ X A^ matrix with entries 

Flata6|cd(Q)((«, j), {k,l)) = Q{i,j,k,l), 

where each index runs through [A]^. Similarly let F\a.tab\cdiP) be the 
X matrix, with indices in [k]^, given by 

Let Nab = Mra ® Mrb and Ned = Mfc Mfd where '®' denotes the 
Kronecker product of matrices. Thus Nab and Ned are A^ x k'^ matrices 
where, for instance Nab{{i,j), {k, /)) = Mra{h k)Mrb{j, I)- Then we have 

(2) Flat„,|,rf(P) = iVj, Y\^tab\cd{Q)Ncd- 

Alternatively, we have other expressions involving flattenings which 
are less natural with respect to the topology of Ti. Let ¥\3Xac\bd{Q) be 



12 



E. S. ALLMAN AND J. A. RHODES 



the X matrix with entries 

F\atac\bdiQ)iii,j), ik,l)) = Q{i,kJ,l). 
Similarly let Flatac|6d(-P) be the x matrix with entries 

Let Nac = ® Mfc and Nf,d = M^b ® ^fd- Then we have 

(3) Flat,,|,d(P) = iVj^ Y\^iac\M{Q)NM. 
A third such expression, obtained in a similar way, is 

(4) Flat„,|fe,(P) = iVj, Y\^iad\bc{Q)Nbc- 

The key observation underlying our proof of the identifiability of the 
tree topology for the (A, K)-state model (Theorem |2l and Corollary 01 
below) is that, for generic parameter choices, the ranks of the matrices 
Flatafc|cd(Q), Flatac|f,d(Q), Flatarf|;,c(Q), are rather different, and this 
affects the ranks of the three corresponding flattenings of P. This 
will lead to explicit polynomials (i.e., phylogenetic invariants) whose 
vanishing can be used to identify the tree topology from which P arises, 
for generic parameters. 

Theorem 2. Suppose \ < . With P denoting anxuxuxn tensor 
of indeterminants, let Si, S2, and S3 denote the sets of (A + l)-minors 
of the X matrices Flataf,|cd(-P), Flatac|bd(-P), and Flatad|6c(-P), re- 
spectively. Let the varieties Yi = ^(5"^) C be their zero sets. Then 

(i) Vt, C Yj if and only if, i = j. 

(ii) IfPe Vt, U U and P G \ (Yj U Yk) for distinct i, j, k, 
then PeVT,\ {Vt^UVtJ. 

(iii) For distinct i,j, k, let Xi = (p^^iYj U Y^) C C^. Then Xi is a 
proper algebraic subvariety of the complex parameter space for 
the 4-taxon tree %, and for any parameters s e \Xj the tree 
Ti is identifiable from the joint distribution tensor P = (f)Ti{s) 
via the vanishing of the polynomials in Si . 

Proof. Throughout, we may assume i = 1, j = 2, k = 3. 

To establish (i), for any parameters s G C'^ on Ti, let Q = Q{s), 
P = P(s) = (j)Ti{s), where Q and P are the tensors described above. 
Now F\atab\cd{Q) is a matrix of all zeros except for a single A x A 
submatrix whose entries are those of A = diag(7rr)Mr/. Thus 

rank(Flata;,|cd(Q)) = rank(yl) < A. 

By Equation this implies rank(Flatafe|cd(-P)) < A. Thus 0Ti(C^) ^ 
Yi, and hence V^i C Yi. 
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We now show Vr^ ^ Y2 or Y^, by finding an s G with 0Ti(s) ^ 
Y2UY3. To pick such an s, we choose each of M^a, Mrb, Mfc, Mj^ to 
have the block form (/^xk ^{\-k)xkY ■ Then the Kronecker products 
Nac, Nm, Nad, Nbc all have block form 

Now choosing tt^ and M^/ to have all positive entries, for instance, 
ensures that all entries of A are non-zero. Since Fla.tac\bd{Q) is a matrix 
of all zeros, except the entries of A which appear in the 
positions, F\sitac\bdiQ) is thus a diagonal matrix of rank A^. Similarly 
F\atad\bc{Q) is diagonal with full rank A^. Thus 

F\atac\bd{P) = NlF\atac\bd{Q)NM, 

F\atad\bc{P) = Nj^Flatad\bc{Q)Nbc, 

both have rank k,^ > X due to the particular form of Nac, Nm, Nad, Nbc- 
Thus 0Ti(s) i Y2 or Fg- 

Statement (ii) follows immediately from (i). 

For (iii), note that the existence of the point s, constructed above, 
with 0ri(s) ^ Y2 or Y^ shows Xi is a proper subset of C-^. That it 
is an algebraic variety follows from its definition as the zero set of all 
polynomials of the form / o ipj^^ where / vanishes on 1^2 U 13. □ 

Remark. The set should be thought of as the set of 'bad' parame- 
ters for the model on Tj, for which this approach is unable to identify 
the tree topology from the resulting joint distribution. It is important 
that Xi be a proper subvariety of the parameter space since this im- 
mediately implies its dimension is smaller than that of the parameter 
space. If one restricts attention from complex to real parameters, or 
even to stochastic parameters, the points in Xi still form a set of lower 
dimension than the full parameter space. Thus for 'most' stochastic 
parameters, the topology is identifiable from the joint distribution. 

Corollary 3. The n-taxon bifurcating tree topology is identifiable for 
generic parameters of the (A, k) -state general Markov model when A < 
K^. That is for each n-leaf tree T , there exists a proper subvariety X^ 
of the complex parameter space such that the tree topology is iden- 
tifiable from the joint distribution arising from any parameter choice 
s G \ Xt via the vanishing of certain explicit polynomials (to be 
described below). 

Proof. As is well known |SS03j . to identify the topology of an n-leaf 
tree T, it is enough to identify the topology of each induced quartet 
tree relating four leaves of T. 
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Let be the complex parameter space for the tree T, and let Q de- 
note the collection of all 4-leaf trees induced by T. For each T' G Q, the 
parameter space for T' is and we have the following commutative 
diagram of polynomial maps: 



Mr' 



The map a^/ can be explicitly given by multiplication of matrix pa- 
rameters for T to obtain matrix parameters for T', once a consistent 
choice of roots for T' and T is made. The map fi^' is a marginalization 
map on tensors, where we sum over all but the 4 indices corresponding 
to leaves of T'. 

For any T' G Q, identify its leaves with labels a, b, c, d so that T' is 
identified with Ti of Theorem |21 Letting Xi, Yi be the varieties defined 
in that theorem, consider varieties 

T'GQ 

T'eQ 

For any parameters s G C-^ \ Xt, (Pt{s) G It ^ (pri^r), and all 4- 
leaf induced tree topologies are identifiable by the vanishing of the 
polynomials defining Y^. (An explicit set of polynomials defining Y-r 
can be taken to be the composition of the polynomials in 5*1 of Theorem 
2 with the marginalization maps fir'-) 

It only remains to show that Xt is a proper subvariety of C^. But it 
is easy to see that each of the maps ot' is surjective. Therefore since 
Xi C C^', for each T' G Q we find C C^. Since Xt is a finite 

union of proper sub varieties of C^, we obtain Xt C C^. □ 

We note that the idea of using rank conditions on fiattenings of a data 
tensor to identify tree topology also appears in recent independent work 
of Eriksson |Eri05j . where the SVD is used to give a novel algorithm 
for tree construction. That paper deals only with the general Markov 
model (A = k), and takes a slightly different approach to identifying 
splits for a tree without focusing on quartets. 

Specializing our result to X = k, we recover the following result 
proved previously by Steel |Ste94j using the log-det distance, and then 
reproved by Eriksson. 
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Corollary 4. The tree topology is identifiable for generic parameters 
in the K-state general Markov model. 

5. IDENTIFIABILITY OF TREE TOPOLOGY FOR ANALYTIC 
(A, k)-STATE MODELS 

To deduce identifiability of the tree topology for analytic (A, K)-state 
submodels using Theorem |2l or Corollary IHl requires a little additional 
work, since, a priori, it is possible that the restricted parameters are 
not sufficiently generic to preserve identifiability. 

Fix an analytic (A, K)-state model, and let ■?/' : f/ — denote its 
Markov map, giving parameters for the general (A, K)-state model in 
terms of parameters for the analytic model. 

We first consider the identifiability of a 4-leaf tree topology, so let 
Ti = Tab\cd- For parameters u of the analytic model, we need only show 
that points ip{u) are generically not in the variety Xi of Theorem |21 

With I{Xi) denoting the ideal of all polynomials vanishing on Xi, 
consider the set 

Xi = iIj-\Xi) = {u: fo iIj{u) = 0, for all / G /(Xi)} CU C M*^ 

Since ip is an analytic map, so is / o ■?/; for each polynomial /, and so 
Xi is an analytic subvariety of U. Thus if we establish that Xi is a 
proper subvariety of U, then generic points in U are mapped to generic 
points (off of Xi) in by ip. This establishes the following: 

Lemma 5. For A < k,"^, consider an analytic {X, K,)-state model on a 
A-leaf tree, with parameter space U and Markov map ip. If there is a 
single choice of parameters u & U such that ip{u) ^ Xi, then the A-leaf 
tree topology is identifiable for generic parameters. 

For ease of application to the specific models listed in Section El we 
deduce a weaker form of this. 

Lemma 6. Consider an analytic n-base, m-class model on a 4-leaf tree. 
Suppose m < K, and there is at least one choice of allowable parameters 
for which 

(i) the Markov matrices for pendant edges are of the form Me = 
Mq = {Ikxk Ikxk ■ ■ ■ Ikxk) 5 and 

(ii) if TVr is the root distribution and Mg the A x A Markov matrix 
assigned to the internal edge of the tree, then the k, x k, matrix 

N = diag(7r^)MeMo 

has at least mn + 1 non-zero entries. 
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Then the tree topology is identifiable for generic parameters of the 
model. 

Before proving this, we note that condition (i) means that no base 
changes occur on pendant edges, though class information is hidden. 
In condition (ii), N represents a joint distribution of bases, without 
class information, at the two internal nodes of the tree. 

Proof. Since m < k, then mn < and Lemma |S1 applies. 

For a parameter choice on the tree Ti as described in the statement 
of the lemma, the joint distribution of bases at the leaves is given by 
P where 



Therefore the matrices Flatac|6d(-P) and Flsitad\bc{P) are diagonal with 
at least mn + 1 non-zero entries. Hence they have rank at least rriK + 1. 
This shows the parameters do not lie in Xi, and so the topology is 



We now obtain the result that provided our original motivation for 
this work. 

Corollary 7. For the covarion model of Tuffley- Steel, if n > 3 the 
A-leaf tree topology is identifiable for generic parameters. 

Proof. This model is an analytic K-base, 2-class model, and so we need 
> 3 to apply Lemma El 

For any R = {Rij),TT, si, $2 with all Si,TTi,Rij > for i 7^ j, the 
matrix diag(7rr) exp((5te) has all positive entries as long as te > 0, so 
then the matrix has all positive entries. Picking such parameters, 
with te > for the internal edge of the tree, and te = for all pendant 
edges. Lemma El gives the result. □ 

This result includes the k = 4, 20 cases which apply to DNA and 
protein sequences. Note, however, that the identifiability of the tree 
topology for the k = 2 covarion model remains an open question. 

Finally, the result extends to trees with more than 4 leaves, by an 
argument analogous to that of Corollary El 

Corollary 8. For the covarion model of Tuffley- Steel, if n > 3 then 
bifurcating tree topologies are identifiable for generic parameters. 

Though we omit the details, one similarly sees that tree topologies 
for the 4-base, m-class covarion model SSRV of |Gal01j are generically 
identifiable provided m < 4. Note that the implementation of the SSRV 




identifiable for generic parameters. 



□ 
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in inference software described in that paper actually had m = 4, a case 
not covered by our theorem. It would of course be desirable to prove 
identifiability for that case, and larger m, as well. 

Finally, we can apply this approach to non-covarion rate-variation 
models with a finite number of rate classes. As an example, we give 
the following result. 

Corollary 9. For the GM+GM+- ■ ■ +GM model, with k states and 
m classes where m < n, bifurcating tree topologies are identifiable for 
generic parameters. In particular, when k = 4, the tree topology is 
generically identifiable for the GM+GM+GM model. 

Proof. For the 4-leaf tree, consider any parameter choice where no sub- 
stitutions occur on pendant edges in any of the classes, the root dis- 
tribution has all positive entries, all Markov matrix entries are non- 
negative, and for at least one class the k x k Markov matrix for that 
class on the internal edge has at least n + l positive entries. Then apply 
Lemma El 

An argument analogous to that for Corollary El extends the result to 
trees with more leaves. □ 

Similarly, for the GTR+rate-classes model we obtain generic identi- 
fiability of tree topology provided the number of classes m is less than 
the number of bases k. Note that while previous result on identifia- 
bility for this model |WS97| Rog01| have allowed a known continuous 



distribution of rates, they have also assumed a common rate matrix for 
all classes. Our result holds even for a model in which different classes 
have unrelated GTR rate matrices. 

Finally, we note this approach proves generic identifiability of tree 
topologies for the GM+I model when k > 3. However, for this partic- 
ular model we will take a different approach in another paper |AR05aj . 
obtaining identifiability for k > 2 as well as some interesting explicit 
formulas for recovering proportions of invariable sites, and identifying 
other numerical parameters as well. 
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